Analysis of 2020 Presidential Election

My initial reason for conducting this analysis was to learn more about using ggplot2 to generate choropleth maps. Additionally, I wanted to improve my visualization abilities using ggplot2. Ggplot2 is a great tool but requires considerable knowledge to use it easily and effectively.

As I started working through the analysis, I realized that the biggest challenge would be the data munging effort required. In many R-library learning exercises the data is pristine and well suited for the presented example. However, in the real world, even data that is sourced from a solid reputable source may have inconsistent, mislabeled and ill-formatted data. There is often sparse or missing documentation about the meta data and

For this analysis, I collected two data sets provided by the MIT Election Data Science Lab. One set contained the US election results for US States and the other contained election results for all US Counties.

Cite: MIT Election Data and Science Lab, 2017, “U.S. President 1976–2020”, https://doi.org/10.7910/DVN/42MVDX, Harvard Dataverse, V6, UNF:6:4KoNz9KgTkXy0ZBxJ9ZkOw== [fileUNF]

Step one of the analysis

Collect the geospatial data necessary to produce choropleth maps

Geospatial data tables contained within the ggplot2 library contain the maps we need for the analysis. We will start with “state” table and later move on to collect the “counties” data.

usa_tbl <- map_data("state") %>% as_tibble()

Generate a US map of states

After downloading the US Map table. We can Use the ggplot geom_map function to display our map. The plotted map shows all the US States. THe states are outlined in grey.

We also add in the coord_map function to properly show the correct projection of the map.

usa_tbl %>% ggplot(aes(long, lat, map_id = region)) +
  geom_map(
    map = usa_tbl,
    color = "grey80", fill = "grey30", size = 0.3
  ) +
  coord_map("ortho", orientation = c(19,-98,0))

Create a US map of the counties

Now we will perform the same type operation in ggplot2 by collecting USA Coutywide geospatial data.

Tables contained in the ggplot2 library use “county” table for geospatial detail.

Since the county data does not have a FIPS identifier, I will have to create a state-county identifier to make sure the county map data has a unique key. Just using a county name as identifier will not work since many counties have the same name across different states. This was the beginning of the data munging effort.

# make a state-county identifier
county_usa_tbl <- map_data("county") %>% as_tibble() %>%
  unite(countyID, region, subregion, sep = "_", remove = FALSE) 

Generate US County map of the states

The map displayed uses ggplot geom_map function and shows counties outlined in grey, Here again I added the coord_map function to properly show the map projection.

county_usa_tbl %>%
  ggplot(aes(long, lat, map_id = region)) +
  geom_map(
    map = county_usa_tbl,
    color = "grey80", fill = "grey30", size = 0.3
  ) +
  coord_map("ortho", orientation = c(19,-98,0))

Now we will get the raw data we need for the election analysis.

Collect the data from Harvard Dataverse (MIT Election Data Science Lab)

Reshaping the presidential data

# thin data to 2020 Presidential data
election_results <- read_csv("voting_data/1976-2020-president.csv", col_names = TRUE) %>%
  filter(year == "2020") %>%
  mutate(pct_votes = as.numeric(round(((candidatevotes/totalvotes)*100), digits = 3 ))) %>%
  rename(region = state) %>%
  mutate(region = tolower(region)) 

# narrow to just voting percentage by Republicans by state
republican_voting <- election_results %>%
  filter(party_simplified == "REPUBLICAN") %>%
  select(2, 15:16)

Create a state view of the voting percentages

usa_republican_tbl <- usa_tbl %>%
  left_join(republican_voting, by = "region")

usa_republican_tbl

Create a choropleth in ggplot of the Republican state percentage of votes

usa_republican_tbl
# create a subregion to categorize choropleths


usa_republican_tbl %>%
  ggplot(aes(long, lat, group = region)) +
  geom_map(
    aes(map_id = region),
    map = usa_tbl,
    color = "gray80", fill = "gray30", size = 0.3
    ) +
  coord_map("ortho", orientation = c(38, -98, 0)) +
  geom_polygon(aes(group = group, fill = pct_votes), color = "black") +
  scale_fill_gradient2(low = "blue", mid = "white", high = "red",
                       midpoint = 50) +
  theme_minimal() +
  labs(title = "Republican State Voting Percentages in 2020",
       x = "", y = "", fill = ""
       ) +
  theme(
    plot.title = element_text(size = 20, face = "bold", color = "red"),
    legend.position = "bottom"
  )

This shows the state percentage level of republican voting. White areas reflect very close election results.

Let’s look at a comparison of States voting Republican vs Democrat

The data is filtered by those states that were very close margins based on voting percentages.

# divide state data into Republican and Democrat states
major_party_voting <- election_results %>%
  select(2:4, 11:12, 15:16) %>%
  filter(party_simplified == "REPUBLICAN" | party_simplified == "DEMOCRAT") %>%
  filter(pct_votes > 47 & pct_votes < 52)
  
 major_party_voting 
# Grouped
ggplot(major_party_voting, aes(fill=party_simplified, y=pct_votes, x=state_po)) + 
  geom_bar(position="dodge", stat="identity")+
  scale_fill_manual(values=c("blue",
                             "red")) +
      labs(title="Major Party Presidential Voting", 
           subtitle="Presidential Data", 
           caption="Source: Presidential Election Data") +
      theme(axis.text.x = element_text(angle=65, vjust=0.6)) +
  coord_cartesian(ylim = c(47, 51)) +
  theme(panel.grid = element_line(color = "grey",
                                  size = 0.75,
                                  linetype = 2)) +
  geom_text(aes(label = pct_votes), vjust = -0.2, size = 3,
            position = position_dodge(0.9)) +
  labs(x = "State", y = "Percent Votes")

#round(pct_votes, digits = o)

The above chart illustrates the close voting margins for several states.

It essentially shows the white states on the choropleth map.

County Level Analysis

Now I shift my analysis to the Presidential Election results at the county level

County level data is pulled from the MIT site.

Cite: MIT Election Data and Science Lab, 2018, “County Presidential Election Returns 2000-2020”, https://doi.org/10.7910/DVN/VOQCHQ, Harvard Dataverse, V9, UNF:6:qSwUYo7FKxI6vd/3Xev2Ng== [fileUNF]

Pull in Presidiential Election results by county

Create a voting percentage field in the data set

Generate a table that illustrates the county-level Republican voting percentages

# change names to lower case values along with names
# create a unique county key
county_election_results <- read_csv("voting_data/countypres_2000-2020.csv", col_names = TRUE) %>%
  filter(year == "2020")%>%
  mutate(pct_votes = as.numeric(round(((candidatevotes/totalvotes)*100), digits = 3 ))) %>%
  mutate(county_name = tolower(county_name)) %>%
  rename(subregion = county_name)%>%
  rename(region = state) %>%
  mutate(region = tolower(region)) %>%
  # need to create unique county name. Concatenated state and county
  unite(countyID, region, subregion, sep = "_", remove = FALSE)

# narrow to just voting percentage by Republicans by county
# filter the data for the totals of votes for the republican candidate
# mode data segments by total, early, mail, election day.  Filter to collect totals
county_republican_voting <- county_election_results %>%
  select(2, 6, 9:11, 13, 14) %>%
  filter(party == "REPUBLICAN") 

Let’s generate a choropleth map at the county level providing same perspective as we did for the states.

Join the Republican county voting table to the county map table for choropleth generation

county_usa_republican_tbl <- county_usa_tbl %>%
  left_join(county_republican_voting , by = "countyID")

county_usa_republican_tbl

Let’s create a choropleth of the County Voting totals for Republicans similar to what we did for the states.

There is an awful lot of dark blue color (0%) over several states

county_usa_republican_tbl
# create a subregion to categorize choropleths

county_usa_republican_tbl %>%
  ggplot(aes(long, lat, group = region)) +
  geom_map(
    aes(map_id = region),
    map = county_usa_tbl,
    color = "gray80", fill = "gray30", size = 0.3
    ) +
  coord_map("ortho", orientation = c(38, -98, 0)) +
  geom_polygon(aes(group = group, fill = pct_votes), color = "black") +
  scale_fill_gradient2(low = "blue", mid = "white", high = "red",
                       midpoint = 50) +
  theme_minimal() +
  labs(title = "County Republican Voting Percentages in 2020",
       x = "", y = "", fill = ""
       ) +
  theme(
    plot.title = element_text(size = 15, face = "bold", color = "red"),
    legend.position = "bottom"
  )

Something looks strange with the choropleth map. Looks like there is a lot of dark blue counties. Dark blue represents a zero voting percentage for the counties in that state. This certain;y warrants further investigation.

Discovery: In this database, not every state reported county files with a “TOTAL” mode. Approx 10 states reported county votes as a subsets breakdown of the modes. These counties are shown in blue and have not been included dorrectly in the choropleth data. Utah is the exception. It has reported both Total and Mode variables. Therefore we need to restructure the data as those states reporting counties_with_totals and those counties_without_totals.

Note: mode variable

  • Description: mode of ballots cast; default is TOTAL, with different modes specified for 2020

We may have to use some TIDY manipulation of the data tables.

We may also need to segment the data and recalculate the Total votes.

Here we divide the election tables into two subsets

One table represents those results that have “Totals” and one that do not have “Totals”.

# Break down into the different ways the counties reported votes

counties_with_totals <- county_election_results %>%
    filter(mode == "TOTAL") 


# use TIDY procedure to pivot long
# after pivot change NA to 0 and sum rows to new variable
counties_without_totals <- county_election_results %>%
    filter(mode != "TOTAL") %>%
    pivot_wider(names_from = mode, values_from = candidatevotes) %>% 
  replace(is.na(.), 0) 

# discovered that UT has both TOTAL and mode fields. Need to remove from data set. Utah will be treated as "Total" county reporter
counties_without_totals <- counties_without_totals %>%
  filter(state_po != "UT") 

counties_without_totals

We now need to restructure the county election table to account for the different modes of voting

We use the TIDY command pivot_wider to reshape the table. It will reshape the table to include the multiple “modes” as individual columns. Then we can summarise the vote percentages. We will also have to group the percentages into a single county total similar to the way the “Total” variable is structured.

# Reshape table pivoting on modes
# use TIDY procedure to pivot long
# after pivot change NA to 0 and sum rows to new variable
# sum up total county votes and add new percentage variable
# select only those records from counties that report modes
counties_without_totals_reshaped <- counties_without_totals %>% 
    mutate(total_county_vote = rowSums(.[13:27])) %>%
    mutate(pct_county_vote = as.numeric(round(((total_county_vote/totalvotes)*100), digits = 3 ))) 
  

#  Use the grouping function to sum up  
counties_without_totals_reshaped_group <- counties_without_totals_reshaped %>%
  group_by(party, countyID, sum(total_county_vote))



# narrow to just voting percentage by Republicans by county
# filter the data for the totals of votes for the republican candidate
# mode data segments by total, early, mail, election day.  Filter to collect totals
county_republican_voting_reshaped <- counties_without_totals_reshaped %>%
  select(2, 6, 9:10, 13, 28:29) %>%
  filter(party == "REPUBLICAN") %>%
  group_by(countyID, party) %>%
  summarise(sum_cty_pct = sum(pct_county_vote))

# create a grouped view of the Republican county data
county_republican_voting_reshaped_group <- county_republican_voting_reshaped



# join with the county map table
county_usa_republican_tbl_reshaped <- county_usa_tbl %>%
  left_join(county_republican_voting_reshaped_group , by = "countyID") %>%
  filter(party == "REPUBLICAN")

county_usa_republican_tbl_reshaped

Munged the data to obtain a grouped picture of the counties that are using mode details to report (non “TOTAL” aggregating counties)

We now have a choropleth of the state counties that use mode for tallying votes. We can use ggplot2 to see if the data seems to be improved.

# create a region to categorize choropleths

county_usa_republican_tbl_reshaped %>%
  ggplot(aes(long, lat, group = group)) +
  geom_map(
    aes(map_id = region),
    map = county_usa_tbl,
    color = "gray80", fill = "gray30", size = 0.3
    ) +
  coord_map("ortho", orientation = c(38, -98, 0)) +
  geom_polygon(aes(group = group, fill = sum_cty_pct), color = "black") +
  scale_fill_gradient2(low = "blue", mid = "white", high = "red",
                       midpoint = 50) +
  theme_minimal() +
  labs(title = "County Republican Voting Percentages in 2020",
       x = "", y = "", fill = ""
       ) +
  theme(
    plot.title = element_text(size = 15, face = "bold", color = "red"),
    legend.position = "bottom"
  )

So let’s look at the state counties that reported their ballots differently

Georgia

Start with a review of Republican County results. I’ve included a snapshot of the collective 159 counties in Georgia.

georgia_republican_county_results <- county_usa_republican_tbl_reshaped %>%
    filter(region == "georgia") 

georgia_republican_county_results

Note: Counting is grouped by mode. Again, there is no “Total” mode for Georgia

georgia_republican_county_results
# create a subregion to categorize choropleths

georgia_republican_county_results %>%
  ggplot(aes(long, lat, group = countyID)) +
  geom_map(
    aes(map_id = subregion),
    map = county_usa_tbl,
    color = "gray80", fill = "gray30", size = 0.3
    ) +
  coord_map("ortho", orientation = c(38, -98, 0)) +
  geom_polygon(aes(group = group, fill = sum_cty_pct), color = "black") +
  scale_fill_gradient2(low = "blue", mid = "white", high = "red",
                       midpoint = 50) +
  theme_minimal() +
  labs(title = "Georgia Republican Voting Percentages in 2020",
       x = "", y = "", fill = ""
       ) +
  theme(
    plot.title = element_text(size = 15, face = "bold", color = "red"),
    legend.position = "bottom"
  )

NEEDS WORK: Note: NOT LOOKING AT TOTAL. LOOKING AT BREAKDOWN BY MODE

Break data into major party and mode of voting

Voting mode represents Absentee, Advanced Voting, Election Day, and Provisional statuses. The pct_votes represent the percentage of votes by a candidate obtained by the way the vote was tallied (by mode).

georgia <- county_election_results %>%
  filter(state_po == "GA") %>%
  filter(party == "DEMOCRAT" | party == "REPUBLICAN") %>% 
  select(5, 9:11, 13:14)

georgia

Georgia Ballot Modes

ga_absentee <- georgia %>%
  filter(mode == "ABSENTEE")%>% 
  arrange(subregion) 


# Grouped
ggplot(ga_absentee, aes(fill=party, y=pct_votes, x=subregion)) + 
  geom_bar(position="dodge", stat="identity")+
  scale_fill_manual(values=c("blue",
                             "red")) +
      labs(title="Absentee Voting", 
           subtitle="Georgia Presidential Data", 
           caption="Source: County Presidential Election Data") +
      theme(axis.text.x = element_text(angle=65, vjust=0.6)) +
  labs(x = "Counties (All 152)", y = "Percent of Vote")

# ga_absentee


ga_advanced <- georgia %>%
  filter(mode == "ADVANCED VOTING")%>% 
  arrange(subregion) 

# Grouped
ggplot(ga_advanced, aes(fill=party, y=pct_votes, x=subregion)) + 
  geom_bar(position="dodge", stat="identity")+
  scale_fill_manual(values=c("blue",
                             "red")) +
      labs(title="Advanced Voting", 
           subtitle="Georgia Presidential Data", 
           caption="Source: County Presidential Election Data") +
      theme(axis.text.x = element_text(angle=65, vjust=0.6))+
  labs(x = "Counties (All 152)", y = "Percent of Vote")

# ga_advanced


ga_election_day <- georgia %>%
  filter(mode == "ELECTION DAY")%>% 
  arrange(subregion ) 


# Grouped
ggplot(ga_election_day, aes(fill=party, y=pct_votes, x=subregion)) + 
  geom_bar(position="dodge", stat="identity")+
  scale_fill_manual(values=c("blue",
                             "red")) +
      labs(title="Election Day Voting", 
           subtitle="Georgia Presidential Data", 
           caption="Source: County Presidential Election Data") +
      theme(axis.text.x = element_text(angle=65, vjust=0.6))+
  labs(x = "Counties (All 152)", y = "Percent of Vote")

# ga_election_day


ga_provisional <- georgia %>%
  filter(mode == "PROV")%>% 
  arrange(desc(candidatevotes)) 



# Grouped
ggplot(ga_provisional, aes(fill=party, y=pct_votes, x=subregion)) + 
  geom_bar(position="dodge", stat="identity")+
  scale_fill_manual(values=c("blue",
                             "red")) +
      labs(title="Provisional Voting", 
           subtitle="Georgia Presidential Data", 
           caption="Source: County Presidential Election Data") +
      theme(axis.text.x = element_text(angle=65, vjust=0.6))+
  labs(x = "Counties (All 152)", y = "Percent of Vote") 

# ga_provisional

Let’s take a look at the IOWA data

IOWA

Here again the choloropleth shows a blueish color indicating the the data may not be correct,

As it turns out we have another state that reported by different modes. In this case it was only two, election day and absentee.

iowa <- county_election_results %>%
  filter(state_po == "IA") %>%
  filter(party == "DEMOCRAT" | party == "REPUBLICAN") %>% 
  select(5, 9:11, 13:14)

iowa
iowa_republican_county_results <- county_usa_republican_tbl_reshaped %>%
    filter(region == "iowa") 

# iowa_republican_county_results


# create a subregion to categorize choropleths

iowa_republican_county_results %>%
  ggplot(aes(long, lat, group = countyID)) +
  geom_map(
    aes(map_id = subregion),
    map = county_usa_tbl,
    color = "gray80", fill = "gray30", size = 0.3
    ) +
  coord_map("ortho", orientation = c(38, -98, 0)) +
  geom_polygon(aes(group = group, fill = sum_cty_pct), color = "black") +
  scale_fill_gradient2(low = "blue", mid = "white", high = "red",
                       midpoint = 50) +
  theme_minimal() +
  labs(title = "Iowa Republican Voting Percentages in 2020",
       x = "", y = "", fill = ""
       ) +
  theme(
    plot.title = element_text(size = 15, face = "bold", color = "red"),
    legend.position = "bottom"
  )

Iowa Ballot Modes

ia_absentee <- iowa %>%
  filter(mode == "ABSENTEE")%>% 
  arrange(subregion) 


# Grouped
ggplot(ia_absentee, aes(fill=party, y=pct_votes, x=subregion)) + 
  geom_bar(position="dodge", stat="identity")+
  scale_fill_manual(values=c("blue",
                             "red")) +
      labs(title="Absentee Voting", 
           subtitle="Iowa Presidential Data", 
           caption="Source: County Presidential Election Data") +
      theme(axis.text.x = element_text(angle=65, vjust=0.6)) +
  labs(x = "Counties (All 99)", y = "Percent of Vote")

# ia_absentee


ia_election_day <- iowa %>%
  filter(mode == "ELECTION DAY")%>% 
  arrange(subregion ) 


# Grouped
ggplot(ia_election_day, aes(fill=party, y=pct_votes, x=subregion)) + 
  geom_bar(position="dodge", stat="identity")+
  scale_fill_manual(values=c("blue",
                             "red")) +
      labs(title="Election Day Voting", 
           subtitle="Iowa Presidential Data", 
           caption="Source: County Presidential Election Data") +
      theme(axis.text.x = element_text(angle=65, vjust=0.6))+
  labs(x = "Counties (All 99)", y = "Percent of Vote")

# ia_election_day